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Abstract 

Since  time-to-solution  and  processor  scalability  of  an 
application  can  vary  greatly  from  one  architecture  to 
another,  it  is  important  to  consider  the  suitability  of  each 
system  with  respect  to  that  application  in  order  to  make 
efficient  use  of  available  resources.  Given  that  the  full  set 
of  possible  applications  is  quite  large,  this  paper  focuses 
on  those  applications  within  the  FY-04  technology 
insertion  (TI-04)  benchmarking  suite  - AERO,  COBALT- 
60,  GAMESS,  HYCOM,  NAMD,  OOCore,  and  RF-CTH- 
for  which  all  except  AERO  have  a standard  and  large  test 
case.  Both  the  overall  performance  and  performance  per 
processor  of  the  High  Performance  Computing 
Modernization  Program ’s  (HPCMP ’s)  major  systems  are 
analyzed  for  each  test  case  in  order  to  provide  general 
selection  guidance  to  users  and  a unique  perspective  for 
code  developers. 

1.  Introduction 

Performance  variation  among  architectures  is  not  a 
new  topic,  especially  to  those  in  the  mid  90’ s responsible 
for  converting  vector  codes  to  efficiently  use  more 
commodity-like  instruction  set  architectures  (IS As). 
Now,  almost  ten  years  later,  vector  architectures  are  back. 
This  time  they  are  just  one  of  a multitude  of  architectures 
that  are  available  to  users,  making  the  mapping  of 
problems  to  systems  that  much  more  difficult.  So, 
assuming  a user  has  a particular  application  and  maybe 
even  a particular  problem  in  mind,  how  does  he  or  she 
decide  what  system  to  use?  Some  may  decide  to  continue 
to  use  a system  they  have  used  before,  but  the  lifespan  of 
that  system  is  limited  (to  typically  3.5  years).  Some  may 
decide  to  stay  with  a particular  vendor,  but  market  forces 
can  in  some  cases  cause  yesterday’s  winners  to  be 
tomorrow’s  losers.  Some  may  decide  to  remain  with  a 
general  architecture,  but  the  industry  is  in  a continual 


process  of  reinventing  itself  as  outside  influences  such  as 
Government  sponsored  research  encourage  vendors  to 
“think  outside  of  the  box”  and  “dream  big”.  So  it  is 
worthwhile  for  a user  to  re-examine  his  or  her  choice  on  a 
periodic  basis  (e.g.,  once  a year),  but  on  what  basis? 
Fortunately,  the  HPCMP  has  decided  as  a matter  of  policy 
to  assess  all  of  its  systems,  both  new  and  old,  using  its 
annually  updated  technology  insertion  (TI-XX) 
benchmarking  suite,  with  the  first  comprehensive 
assessment  being  performed  using  the  TI-04  suite  (i.e., 
the  suite  for  the  most  recently  completed  program 
acquisition).  Therefore,  a rich  set  of  data  is  available  to 
compare  existing  program  systems. 

The  usefulness  of  the  comparison  will  lie  in  how  the 
well  the  user’s  application  and  problem  (or  test  case) 
maps  to  a particular  application  and  test  case  in  the 
benchmarking  suite.  As  a cursory  guide,  discipline 
associations  for  each  TI-04  benchmarking  code  are 
provided  below: 

CCM 

GAMESS  - quantum  chemistry 
NAMD  - molecular  dynamics 
CEA 

OOCORE  - electromagnetics 
CFD 

AERO  - aeroelastic  fluid/structure  interactions 
COBALT-60  - general  flow  (Euler/Navier-Stokes) 

CSM 

RFCTH  - shock  physics 
CWO 

HYCOM  - ocean  modeling 

Additional  code  descriptions  can  be  found  in  Tracy 
et  al.,  2003m. 

2.  Problem  and  Methodology 

Work  is  often  classified  into  one  of  two  types  - 
capability-oriented  and  capacity-oriented.  In  the  extreme 
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case,  capability-oriented  problems  are  so  large  that  all 
processors  are  required  on  a well-balanced,  state-of-the- 
art  system  in  order  to  reduce  the  time-to-solution  to  a 
reasonable  fraction  of  the  system’s  mean-time-to-failure. 
For  such  problems,  the  total  capability  of  a system  is  of 
interest  and  is  determined  by  the  time-to-solution  for  the 
problem  when  using  all  processors.  Capacity-oriented 
problems,  on  the  other  hand,  have  reasonable  time-to- 
solutions  on  a fraction  of  the  same  system,  but  often 
require  a large  number  of  independent  executions  to  cover 
a host  of  scenarios.  In  that  case,  the  total  capacity  of  the 
system  is  of  interest  and  is  deduced  by  determining  how 
many  scenarios  that  system  can  execute  in  a given  amount 
of  time.  For  users  that  require  high  single  image 
capability  or  capacity,  performance  results  are  provided, 
while  for  users  that  require  moderate  single  image 
capability  and  capacity,  or  require  high  capacity  but  are 
willing  to  spread  work  across  a number  of  systems, 
performance  per  processor  results  are  provided. 

Twenty-seven  major  unclassified  HPCMP  systems 
are  included  in  this  study,  three  of  which  are  systems  that 
were  purchased  in  the  TI-04  acquisition,  and  are  therefore 
identified  by  non-descriptive  monikers  (e.g.,  System  A). 
Actual  system  identifications  can  be  provided  along  with 
verified/updated  results,  once  installation  by  the 
respective  vendors  and  acceptance/benchmarking  by  the 
Government  for  these  systems  has  been  completed. 

3.  Results 

For  each  application  test  case,  the  major  unclassified 
HPCMP  systems  were  ranked  by  performance  (Table  1) 
and  performance  per  processor  (Table  2)  with  the  top  and 
bottom  five  systems  denoted  in  tan  and  red,  respectively. 
The  top  and  bottom  five  were  additionally  extracted  and 
displayed  by  architecture  in  Tables  3 and  4. 

General  Performance  - Systems  B and  C,  the  large 
700MHz/p  03900s  at  ASC  and  ERDC,  the  large 
1.3GHz/p  P4  at  NAVO,  and  the  medium-sized  1.3GHz/p 
P4  at  ARSC  were  consistently  top  performers,  while  the 
small  1.7GHz/p  P4s  at  ARL  and  ARSC,  the  medium- 
sized 375MHz/p  P3  at  ASC,  the  small  833MHz/p  SC40  at 
ASC,  and  the  large  T3Es  at  ERDC  and  AHPCRC  were 
consistently  poor  performers.  For  smaller  systems  with 
newer  architectures,  it  was  not  surprising  that  poor 
performance  was  observed,  given  the  premise  of  the 
comparison  - overall  capability.  For  a larger  system  of 
the  same  type,  the  ranking  would,  no  doubt,  improve.  The 
400MHz/p  XI s at  ERDC,  AHPCRC,  and  ARSC  exhibited 
a bi-modal  performance  with  good  marks  on  AERO, 
Cobalt-60  Standard  and  Large,  and  HYCOM  Standard, 
and  poor  marks  on  GAMESS  Standard  and  Large, 
NAMD  Standard  and  Large,  and  RFCTH  Large.  For 
AERO,  GAMESS,  and  NAMD,  these  results  were  not 


surprising  given  that  AERO  is  a vector  code,  and 
GAMESS  and  NAMD  do  not  vectorize  well. 

General  Performance  per  Processor  - Systems  A 
and  B,  the  small  1.7GHz/p  P4s  at  ARL  and  ARSC,  the 
small  3.06GHz/p  Xeon  cluster  at  ARL,  and  the  400MHz/p 
XI s at  ERDC,  AHPCRC,  and  ARSC  consistently 
demonstrated  good  performance  density,  while  the  large 
375MHz/p  P3s  at  ARL  and  NAVO,  the!  medium-sized 
375MHz/p  P3  at  MHPCC,  and  the  large  T3Es  at  ERDC 
and  AHPCRC  consistently  demonstrated  poor 
performance  density.  I 

Additional  Performance  Notables 

• System  A performed  well  for  both  test  cases  of 

GAMESS  and  OOCore,  but  only  moderately 
well  for  the  other  test  cases.  [ 

• Despite  being  a generally  good  performer, 
System  C performed  poorly  on  AERO. 

• The  small  1.7GHz/p  P4s  at  ARL  and  ARSC 
performed  moderately  well  on  AERO,  despite 
being  generally  poor  performers,  j 

• The  small  3.06GHz/p  Xeon  cluster  at  ARL 

performed  poorly  on  AERO  and  for  both  test 
cases  of  Cobalt-60,  but  well  for  both  test  cases  of 
GAMESS.  | 

• The  medium-sized  400MHz/p  03800s  at  ARL 
and  ERDC  performed  well  on  synthetic  tests,  but 
poorly  on  all  other  test. 

• The  small  1 GHz/p  SC45  at  ASC  performed 
poorly  for  HYCOM  Large  and  OOCore  Large, 
and  not  much  better  for  the  other  test  cases. 

• The  large  T3E  at  ERDC  performed  well  for 

GAMESS  Standard,  despite  being  a generally 
poor  performer.  i 

• The  medium-sized  375MHz/p  P3  at  MFIPCC 

performed  better  than  a smaller  version  with  a 
like  architecture  at  ASC  due  to  its  additional  size 
yet  still  performed  poorly.  I 

• The  small  1.3GHz/p  P4  at  MHPCC  performed 
well  on  AERO  and  poorly  on  the  synthetic  tests. 

Additional  Performance  Density  Notables 

• System  B exhibited  a poor  performance  density 

(PD)  for  the  synthetic  tests  despite  generally 
having  a good  PD.  ! 

• System  C exhibited  a good  PD  for  both  test  cases 
of  GAMESS  and  HYCOM  Standard,  but  a poor 
PD  for  AERO  and  the  synthetic  tests. 

• Despite  having  generally  good  PDs,  the  small 
1.7GHz/p  P4s  at  ARL  and  ARSC  exhibited 
relatively  poor  PDs  for  the  synthetics  tests. 
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• The  medium-sized  400MHz/p  03800s  at  ARL 
and  ERDC  exhibited  good  PDs  for  the  synthetic 
tests,  but  poor  PDs  for  everything  else. 

• The  small  833MHz/p  SC40  at  ASC  exhibited  a 
good  PD  for  both  test  cases  of  NAMD,  but  a 
poor  PD  for  both  test  cases  of  Cobalt-60  and 
OOCore. 

• The  large  700MHz/p  03900s  at  ASC  and  ERDC 
exhibited  good  PDs  for  the  synthetic  tests,  but 
poor  PDs  for  everything  else. 

• Despite  having  generally  good  PDs,  the  small 
400MHz/p  Xls  at  ERDC,  AHPCRC,  and  ARSC 
exhibited  poor  PDs  for  both  test  cases  of 
NAMD. 

• The  medium-sized  833MHz/p  SC40  at  ERDC 
exhibited  a poor  PD  for  both  test  cases  of 
OOCore,  but  a descent  PD  for  both  test  cases  of 
NAMD  and  the  synthetic  tests. 

• The  small  1.3GHz/p  P4  at  MHPCC  exhibited  a 
good  PD  for  NAMD  Large  and  a poor  PD  for  the 
synthetics  tests. 

4.  Significance  to  DoD 

The  mapping  of  problems  to  resources  significantly 
impacts  the  efficiency  of  the  program,  given  the  diversity 
of  the  system  architectures  and  sizes  that  are  available  as 
well  as  the  large  span  of  problems  at  hand.  Providing 
detailed  performance  (and  performance  density)  data  to 
users  aims  at  improving  this  mapping  by  swaying  the 
users’  choice  of  platforms  to  those  best-suited  for  their 
problems. 


5.  Systems  Used 

All  major  unclassified  systems  within  the  HPCMP 
were  used. 

6.  CTA 

The  computational  areas  covered  by  this  effort 
include  CCM,  CEA,  CFD,  CSM,  and  CWO. 
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Table  2.  Performance  Per  Processor  Ranking 
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Table  3.  Best  and  Worst  Five  Systems  by  Architecture  (Performance). 
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* The  entries  in  this  table  are  susceptible  to  system  size.  Please  supplement  with  data  in  Table  1 . 
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AERO 

Cobalt60 

GAMESS 

HYCOM 

NAMD 

OOCore 

RFCTH 

Overall 

Overall 

Overall 

std 

std 

'a 

std 

ig 

std 

ig 

std 

ig 

std 

!a_ 

std 

ig 

Svnth 

'App 

Score 

NEW 

NEW 

NEW 

NEW 

NEW 

NEW 

03900 

NEW 

P4-1.7 

P4-1.7 

P4-1.7 

P4-1.7 

P4-1 .7 

P4-1.3/1.7 

P4-1.7 

P4-1.7 

P4-1.7 

P4-1.7 

SC40 

SC40/45 

XI 

XI 

XI 

XI 

XI 

XI 

XI 

XI 

XI 

XI 

Xeon  Cl 

Xeon  Cl 

Xeon  Cl 

Xeon  Cl 

NEW 

03800 

03800 

03800 

j P3 

P3 

P3 

P3 

P3 

P3 

P3 

P3 

P3 

P3 

P3 

P3 

P3 

P3 

_ 

. .... 



SC40 

SC40 

.... 

T3E 

T3E 

T3E 

T3E 

T3E 

T3E 

T3E 

■mi 

T3E 

T3E 

T3E 

T3E 

T3E 

T3E 

T3E 

■ 

■ 



